Data Visualization

Motivation, Foundations, Reading Closely and Common Errors

by Ayush Patel

delivered at Azim Premji University, Bhopal

2025-10-22

Hello!

  • I am Ayush.
  • I work at the intersection of data, policy and development.
  • I sometimes teach data analysis skills using R to those who will suffer me.

What is in it for you?

“The numbers have no way of speaking for themselves. We speak for them. We imbue them with meaning” - Nate Silver

“मेरे इज़हार से पहचान मिली है वरना
लॅफ्ज़ अवारा फिरा करते थे मानी के लिए” - मदन मोहन दानिश

What is the talk about?

  • Why do we need data viz?
  • A quick review of fundamentals: Variables and appropriate visualization
  • Markers of Good* data viz
  • Reading Closely
  • Common Errors
  • Resource Reccomendations

What is not covered?

This is my EVIL Plan:

  • Tell them what is good visualization
  • Tell them what is bad visualisation
  • But do not reveal how to actually make good viz in a consistent and efficient manner. [Cue Diabolical Laughter from Despicable ME]

Motivation

Can you tell me something?

Following is the number of time I day dream about food, recorded over two weeks:

 [1] 24 25 26 19 24 30 27 29 26 27 28 26 27 20

Here is the number of times I tried to complete this presentation over the same period of time:

 [1] 26 24 29 14 26 28 21 28 28 35 31 19 24 19

It is not easy and intuitive to look at number and say something about it. Even if summary stats are provided. Let us look at this more closely.

You all know of the quartet - Anscombe and friends

   x1 x2 x3 x4    y1   y2    y3    y4
1  10 10 10  8  8.04 9.14  7.46  6.58
2   8  8  8  8  6.95 8.14  6.77  5.76
3  13 13 13  8  7.58 8.74 12.74  7.71
4   9  9  9  8  8.81 8.77  7.11  8.84
5  11 11 11  8  8.33 9.26  7.81  8.47
6  14 14 14  8  9.96 8.10  8.84  7.04
7   6  6  6  8  7.24 6.13  6.08  5.25
8   4  4  4 19  4.26 3.10  5.39 12.50
9  12 12 12  8 10.84 9.13  8.15  5.56
10  7  7  7  8  4.82 7.26  6.42  7.91
11  5  5  5  8  5.68 4.74  5.73  6.89
# A tibble: 4 × 3
  variable  mean    sd
  <chr>    <dbl> <dbl>
1 y1        7.50  2.03
2 y2        7.50  2.03
3 y3        7.5   2.03
4 y4        7.50  2.03

Corr x1, y1: 0.8164205
Corr x2, y2: 0.8162365
Corr x3, y3: 0.8162867
Corr x4, y4: 0.8165214

Things are not always as they seem

Anscombe’s quartet-from Data Visualization by Healy

Clearly, looking at data helps

  • Helps the build intuitive understanding of the data
  • Identify patterns, sometimes expected, sometimes unexpected
  • Convey a lot of information in a concise an accessible and memorable manner
  • All the points are true for the people generating as well as consuming a visualization

Fundamentals

Variables - What?

A record of any measurement of interest.

  • Total fans in every room of APU, Bhopal
  • Proportion of left handed people in every class at APU, Bhopal
  • Number of students that ate lunch in mess on every wednesday
  • Ice Cream flavours sold everyday at a retail shop
  • Names of people who think potato belongs in Biryani

Variables - Type

  1. Continuous: Amount of protien in every meal (Hello to those gym going fools)
  2. Discrete/Count: Number of buffalos owned by each household in Balasore
  3. Ordinal: How was the Biryani? - very bad, bad, ok, good, very good
  4. Categorical: Types of Biryani

Can you come up with any fun examples?

Categorical and Ordinal Data - Continency and Frequency Tables

Table 1: Government Primary Schools in Gujarat
(a) A Frequency table
Govt Primary School (Status A(1)/NA(2)) n percent valid_percent
1 17379 95.4% 97.4%
2 464 2.5% 2.6%
NA 382 2.1% -
Total 18225 - -
(b) A Contingency table
Govt Primary School (Status A(1)/NA(2)) a b c NA_
1 0.0% (0) 0.0% (0) 0.0% (0) 100.0% (17,379)
2 49.4% (229) 31.7% (147) 16.4% (76) 2.6% (12)
NA 0.0% (0) 0.0% (0) 0.0% (0) 100.0% (382)
Total 49.4% (229) 31.7% (147) 16.4% (76) 202.6% (17,773)

Categorical and Ordinal Data - Bar Chart

Figure 1: Distance to Government Primary Schools in Gujarat
(a) Counts
(b) Proportion

Categorical and Ordinal Data - Bar Chart (2.1)

Figure 2: Acess to Primary Health Centre in Gujarat - Stacked

Categorical and Ordinal Data - Bar Chart (2.2)

Figure 3: Acess to Primary Health Centre in Gujarat - Grouped

Categorical and Ordinal Data - Bar Chart (2.3)

Figure 4: Acess to Primary Health Centre in Gujarat - Proportion

Framing

  1. “Framing can effect how we feel about the numbers shown”
  2. The same statistic can be shown in a negative frame or positive frame.
  3. “Us reports mortality rates of child heart surgery, while UK provides survival rates.”
  4. “Ideally both negative and positive frames should be presented if we want to provide impartial information…”

Framing - Table

Table 2: from: The Art of Statistics on Framing
Hospital Operations Survivors Deaths ThirtyDaySurvival PercentageDying
London - Harley Street 418 413 5 98.8 1.2
Leicester 607 593 14 97.7 2.3
Newcastle 668 653 15 97.8 2.2
Glasgow 760 733 27 96.3 3.7
Southampton 829 815 14 98.3 1.7
Bristol 835 821 14 98.3 1.7
Dublin 983 960 23 97.7 2.3
Leeds 1038 1016 22 97.9 2.1
London - Brompton 1094 1075 19 98.3 1.7
Liverpool 1132 1112 20 98.2 1.8
London - Evelina 1220 1185 35 97.1 2.9
Birmingham 1457 1421 36 97.5 2.5
London - Great Ormond Street 1892 1873 19 99.0 1.0

Comparing Proportions

from: The Art of Statistics

  1. The international Agency for Research on Cancer classified Group1 processed measts as carcinogenics.
  2. This lead to poor headlines.
  3. “50g of processed meat a day was associated with an increased risk of bowel cancer by 18%”
  4. Would you stop eating bacon?

Absolute or Relative

The example mentions 18%. Sure sounds a big enough proportion to worry about getting bowel cancer. BUT, this is not Absolute difference in risk. This is relative risk to people who consume bacon every day.

  1. People who do not consume processed eats have 6% chance of getting bowel cancer
  2. Similar, or the same people, if ate processed eats every day of their life would have 18% more chance of bowel cancer.
  3. This does not mean: 6% + 18 %
  4. This means: 6*1.18 ~ 7%

Continuous and categorical Variables

Village population across districts

Continuous Variable

Distance to closest town

Continuous Variables

Comparing population and geographical area of villages

Attepmts to indentify good visualizations

Can we identify an effective vs a bad visualization

Napolean’s retreat from Russia by Minard-from Data Visualization by Healy

`Monstrous Costs’ by Nigel Holmes-from Data Visualization by Healy

Can we identify an effective vs a bad visualization

Rainfall in Glasgow and Edinbrugh-from Cara Thompson’s More than pretty graphs

Rainfall in Glasgow and Edinbrugh-from Cara Thompson’s More than pretty graphs

Tufte on Visualization

“Graphical excellence is the well-designed presentation of interesting data—a matter of substance, of statistics, and of design … [It] consists of complex ideas communicated with clarity, precision, and efficiency. … [It] is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space … [It] is nearly always multivariate … And graphical excellence requires telling the truth about the data. (Tufte, 1983, p. 51).”

Markers of a good visualization

  • Not just about how it looks, though this makes the graph memorable
  • Depends who is looking at it
  • Why are they looking at it - what is expected out of the chart
  • Essentially, you need both, good taste as well as an understanding of how human visual perception works. The latter can be learned with practice and in relatively less time than the first. Good taste needs to be developed and it takes time.

A discussion on this figure

reference to NYT graph-from Data Visualization by Healy

Identifying features of bad visualizations

The following problems are distinct but can appear in various combinations in a given figure.

  • Strictly Aesthetic
  • Substantive - the data presented is somehow off
  • Perceptual

Back to Democracy - what are the good things?

reference to NYT graph-from Data Visualization by Healy

What if I tell you

  • These are are not the responses from a longitudinal survey
  • Actually, it is the same question asked to people born in different decades, i.e., different age groups.
  • Cherry on top it was not a binary question
  • Identifying substantive problems require understanding of underlying data of a chart, being observant of any transformations and consequent effects, etc..

A good samaritan

Voeten’s response to NYT graph-from Data Visualization by Healy

One for the finance bros

Liz Ann Sonders, Chief Investment Strategist with Charles Schwab, Inc,-from Data Visualization by Healy

What if it was shown this way

Healy’s examples for possible manipulations from Data Visualization by Healy

Healy’s examples for possible manipulations from Data Visualization by Healy

How to address such issues - Healy’s Alternative

Healy’s Alternative to the index vs money base chart from Data Visualization by Healy

Garden variety mistakes - Single variable distributions

Bin width example from Fundamentals of Data Visualization by Wilke

Garden variety mistakes - Single variable distributions

Incorrect data representation example from Fundamentals of Data Visualization by Wilke

Garden variety mistakes - two or more variable distributions

Multiple Distribution common error from Fundamentals of Data Visualization by Wilke

Multiple Distribution common error from Fundamentals of Data Visualization by Wilke

Alternatives from Wilke

for multi variable distribution

Alternatives for Multiple Distribution from Fundamentals of Data Visualization by Wilke

Alternatives from Wilke

for multi variable distribution

Alternatives for Multiple Distribution from Fundamentals of Data Visualization by Wilke

Alternatives from Wilke

for multi variable distribution

Alternatives for Multiple Distribution from Fundamentals of Data Visualization by Wilke

Alternatives from Wilke

for multi variable distribution

Alternatives for Multiple Distribution from Fundamentals of Data Visualization by Wilke

Garden Variety mistake - Unordered Barcharts

Resources -

Thank you.